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@ In a searching for library documents that match the content of a given sequence of query words, a set of 
equivalent words are defined for each query word along with a corresponding word equivalence value assigned 
to each equivalent word. Target sequences of words in a library document which match the sequence of query 
words are located according to a set of matching criteria. The similarity value of each target sequence is 
evaluated as a function of the corresponding equivalence values of words included therein. Based upon the 
similarity values of its target sequences, a relevance factor is then obtained for each library document. 
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TEXT SEARCH SYSTEM 



This invention relates to infornnatlon retrieval systems. More particularly, this invention relates to method 
and apparatus for locating library documents the content of which matches the content of a sequence of 
query words in accordance with a given set of matching criteria, as well as evaluating the relevance of ^ 
these documents. ; 

5 Recent advance in mass storage technology and data base management techniques has increased the 
feasibility of storing vast amount of documents in data processing systems, thereby providing an opportu- 
nity of utilizing the processing power of these systems in facilitating the retrieval of the stored documents. 

Descriptions of representative prior art search systems can be found in C. Faloutsos. "Access Methods 
for Text". ACM Computing Surveys vol. 17, no. 1, March 1985, pp. 49-74; D. Tsichristzis et aJ, "Message 

10 RIes", ACM Trans, on Office Information Systems, vol. 1. no. 1. Jan. 1983. pp. 88-98; and G. Salton, "The 
SMART Retrieval System - Experiments in Automatic Document Processing, Prentice Hall. 1971. 

Search commands provided by most prior art text search facilities typically include a set of query 
words, together with some specifications defining their contextual relationships. A library document is 
retrieved if it contains words that are identical or equivalent to the query words and the occurrences of 

75 which satisfy the specified relationships. In these prior art search facilities, equivalent words are usually 
given the same degree of signi^cance (weight). The contextual relationships are. basically, defined only in 
terms of Boolean logic and adjacency operators. As a result, the flexibility provided by these facilities for 
expressing a desired search content is usually very limited so that a given search may not be as accurate 
as one would desire. 

20 An object of this invention is to provide a text search facility that allows users to more accurately and 
flexibly define the scope within which documents with alternative expressions of a desired content will be 
retrieved; by allowing its users to assign different weights to equivalent words, and by allowing its users to 
define structures of words which the users consider acceptable. To further enhance flexibility, it is another 
object of this invention to provide a search facility that can evaluate, based upon user-provided criteria, a 

25 value to represent the relevance of a document. Moreover, since the relevance of a document depends on 
application, user and temporal factors, it is a further object of this invention to allow users to specify how 
relevancy is to be measured, and also to provide a search facility whereby located documents can be 
ranked in accordance with their respective relevance. 

The invention disclosed herein is a method and an apparatus for retrieving, from among a library of 

30 more than one document those that match the content of a sequence of query words. The method 
comprising the steps of: defining a set of equivalent words for each query word and assigning to each 
equivalent word a corresponding word equivalence value; locating target sequences of words in a library 
document that match the sequence of query words in accordance with a set of matching criteria; evaluating 
a similarity value for each of said target sequences words as a function of the corresponding equivalence 

35 values of words included therein; and obtaining a relevance factor for the library document based upon the 
similarity values of its target sequences. 

According to the invention there is provided a method implemented in a data processing apparatus for 
retrieving from among more than one library document those matching the content of a sequence of query 
words, comprising the steps of: 

40 (a) defining a set of equivalent words for each of the query words and assigning a word equivalence 

value to each of said equivalent words; and 

(b) computing a relevance factor for a library document, comprising the steps of: 
(i) locating target sequences of words in the library document that match the sequence of query 
words, and equivalence thereof, according to a set of matching criteria; 

45 (ii) evaluating similarity values of said target sequences of words, each similarity value being . 

evaluated as a function of the equivalence values of words included in the conresponding target sequence; ^ 
and 

(iii) said relevance factor being computed as a function of the similarity values of its target 
sequences. 

50 In order that the invention may b fully understood, it will now be described by way of example with 
reference to the accompanying drawings in which; 

Fgs. la and lb give an overview of the retrieval process according to the preferred embodiment of this 
invention. 



2 



EP 0 304 191 A2 



DESCRIPTiON OF THE PREFERRED EMBODIMENT 

Fig. la is a diagram giving an overview of the document retrieval system according to the present 
invention. The preferred embodiment is implemented in a data processing complex that comprises a 
5 general purpose central processing unit (CPU) connected to a plurality of secondary devices which include 
external storage devices and Input/Output equipment. An example of the CPU is the IBM System/360 or 
IBM System/370 described in U.S. Pat. No. 3,400,371 by G.M, Amdahl et al, entitled, "Data Processing 
System", or in IBM SystenrV370 Principles of Operation. IBM Publication GA22-7000-6. The complex Is 
operating under the control of an operating system, such as the IBM Systenn/360 or System/370, operating 

10 in conjunction with the IBM Information Management System/Virtual Storage (IBM IMS/VS). System/360 and 
System/370 are respectively described in IBM publications GC 28-0661 and GC 20-1800. IBM IMSA/S is 
described In IBM publication SH20-9145-0. 

In Hg. lb. there is shown a plurality of library text documents, usually stored in the secondary storage 
devices of the complex. Each library text document is processed by a text Analysis Manager to generate a 

75 word list (called Doc-Ust) that is added to a collection called DOC-DB. In generating a Doc-List, the Text 
Analysis Manager eliminates stop words and punctuation marks, and transforms every word into its 
canonical form. One implementation of DOC-OB is to have an inverted index structure defined on a 
vocabulary, as shown in Table 1, in which each word in the vocabulary is represented by a record 
comprising a set of identifiers (Doc-List_ID's) pointing to the Doc-Lists that contain the word, together with 

20 addresses of the positions (POS-i) in a Doc-List where the word occurs. 

A document retrieval process is initiated upon the receipt of a search command which comprises a 
sequence of query words. The search command may be entered directly by a user or indirectly from a 
user-friendly interface such as a dialogue system, a menu system, a natural language system, or an expert 
system. The query words (qi ...,qm) are processed by the Text Analysis Manager to generate a word list, 

25 denoted as a Query-List. Q. 

Based upon a set of word similarity information, the Query-List is expanded by the Similarity Manager 
into a set of vectors QS(qj). The similarity information may be part of the search command. However, it is 
obvious that it may also be predefined and stored in the system. 

According to this invention, equivalent words refer to words between which an application has assigned 

30 a acceptable relevance. By way of examples, an application may partition the vocabulary into equivalence 
classes where words of one class are defined as equivalent to each other; it may consider the number of 
basic editing operations (e.g. inserting or deleting a character) that are necessary to obtain one word from 
the other, without giving any significance to the meaning of the words; or it may give a measure of the 
closeness in meanings of each pair of words. 

35 in addition to defining a set of equivalent words (wju's) for each query word (qj), the similarity 
information includes a word equivalence value (SjO assigned to the corresponding equivalent word wj^. 

The Similarity Manager converts each word (qj) In the Query-list into a set of vectors, QS(q]). Each 
element QS(qj) comprises the word identifier (QS_WORD) of a Wjn and its conresponding word equivalence 
value {QS_SiMILARITY) Sjk. The QS(j)s are concatenated to fonm an overall vector QS__VECT. QS_VECT 

40 has a pointer, QS_PTR[j]. pointing to the offset of a QS(q]) (Le. QS_PTRQ] = the index of the first 
element of a QS(q|) in QS_VECT). A high-level language structure of QS-VECT is given Table 2a. 

A search command also includes a set of matching criteria. There are three matching criteria 
implemented in this preferred embodiment 

The first matching criterion, x, regulates the form in which a sequence of query words is mapped onto 

45 the words of a document (i.e. the type of inclusion). Three types of inclusion are recognized: 

. An unordered Inclusion, (x = 1). exists if a word sequence LI can be mapped onto distinct words of a word 
sequence L2 with no constraint being put on the ordering of the words. As an example, an unordered 
inclusion of LI on L2 exists if LI is a word sequence CBD (each letter, C,B, and D represents a word) and 
L2 is a word sequence ABCD. 

60 . An ordered inclusion (x=2) exists if. in addition to satisfying the unordered inclusion, the order of the 
words in sequence L2 is the same as in L1 (e.g. LI is a word sequence of ACD and L2 is a word sequence 
ABCD). 

. A contiguous inclusion {x = 3) exists if. in addition to an ordered = inclusion, all words in L2 mapped by L1 
are consecutive (e.g. LI is a word sequence of BCD and L2 is a word sequence of ABCD). 
65 The second matching criterion, y, concerns the relevanc of spread in a sequence of docum nt words 
that match a query word sequence. This criterion takes into account the word span of the sequence and is 
defined as the length of the sublist in L2 which satisfies a given inclusion of LI. For example, let LI = 
ACD. Li" = ABE, and L2 • ABCDE. There is an order d inclusion of both Ll' and Li" on L2, but the spread 
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of Li' is four words (ABCD) and the spread for LI* is five words (ABCDE). in most situations a match is 
considered b tier when its spread is smaller. Thus, the inclusion of Li ' or L2 Is considered to be stronger 
than the inclusion of Li" on L2, The spread requirement includes a maximum allowable span and a 
weighting of the relevance measured by th span of a matched sublist. 

5 The third matching criterion, relates to the relevance of the complet ness of a match. If a complete 
match (z=) Is specified, then only documents having a sequence of words which completely match the 
query word sequence are retrieved. Otherwise, for incomplete match (2=0), documents are retrieved even 
if the content of a document matches only a subset of the query words. The incomplete match requirement 
includes a minimum number of matched words and a weighting of the relevance measure by the actual 

10 number of matched words. 

Based upon these three matching criteria, a collection of similarity values Sxyz would exist. Each Sxyz 
represents the degree of similarity between the query word sequence and a con'espondlng word sequence 
which matches it In accordance with a given sat of matching criteria, xyz. A list of all possible Sxyz. is given 
in Table 3. 

IS Given two sequences of words, L1=ut Um. L2=vt v„. and a word equivalence value e(U|.V|) 

between a word Ui in LI and a word Vj in L2. The general formula for calculating a similarity value Sk/x is: 

f i-1 ^ ^ xi^^ 

20 xz 



where 

fxz (i) = position of the word v that matches the word Uj 

d(f)tz) = m, if y = 1 

= max (m. apread of fxz). if y =2. 

However, an additional weighting factor g may be combined with e(U(. Vj) to fomn s(u!, vj) so that 

m 

S (L1,L2) - Max _ (l/d(f ) _s(u, ,v. . . . ) 
xyz - xz .-^ 1 f ^^2.) - 



xz X2 XZ 



g may represent a weighting factor depending on the position within the document of the matched word vj. 
The more detailed fonnns of Sxyz are listed in Table 4. 

40 The sequence of expanded query word sets QS(qj), together with a given set of matching criteria, are 
received by the Selection Manager to select those documents in DOC-DB that are potentially relevant 
However, only a subset of documents from DOC-DB may be received by the Selection Manager. This 
subset of documents is preselected from DOC-DB based upon optional input infomnation (e.g. all docu- 
ments stored before the year 1970). 

45 An assumption is made that a document is relevant if it contains at least one word similar to a query 
word. For each potentially relevant Doc-list, the Selection Manager derives a position vector P representing 

a sequence of positions pi pn of those Doc-list words that have a non-zero similarity value with some 

query words, pi thus denotes the position of the i-th Doc-list word that has a non-zero similarity with some 
query words. 

50 The Selection Manager also generates a Document Query Matrix (DQ-matrix). Each element (dqii) in the 
DQ-matrix contains the similarity value (Sjk) between the prth Doc-list word (dpi) and the j-th query word 
(qj). Every row of the DQ-matrix would have at least one non-zero element since, if this were not tme. the 
Doc-list word dpi would have null similarity with all the query words and would not have been selected by 
the Selection Manager. On the other hand, if each row of the DQ-matrix has at most one non-zero element. 

55 the matrix is referred to as separable, otherwise the matrix is refen-ed to as general. Moreover, if the DQ- 
matrix has only binary values, the DQ-matrix is refenred to as binary, othenvise the matrix is referred to as 
continuous. 

The above described selection procedure perform d by the Selection Manager can be summarized in 
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the pseudo code list in Table 5. 

One way of implementing the DQ-matrix is by storing non-zero elements of each row along with their 
respective positions within the row. An element representing a non-zero element in the i-th row of the DQ- 
matrix would contain an index and the element dqij. A vector DQ_VECT is formed by concatenating the 
s rows of the DQ-matrix. A high level language representation of the DQ-matrix is given in Table 2b. 

The P-vector and the DQ-matrix are received by the Matching Algorithm Manager to select a matching 
algorithm, to solve a particular Sy^, To improve efficiency, each S^yz is further sub-divided, according 
to whether the DQ-matrix is separable or general, and whether it is binary or continuous, into four 
subcombinations. The four subcombinations are denoted as; 

70 

A^xyz = separable and binary 
A^xyz = separable and continuous 
A^xyz = general and binary 
A^'xyz = general and continuous 

Before describing the algorithms for solving the similarity values, a general method is described herein 
for translating an algorithm which is used for solving similarity values Sxiz having a no-spread criterion, to 
an algorithm Sx2z wherein spread consideration is required (other criteria being identicai): 

Given an algorithm A^xiz where spread is not required, the similarity value S*x2z for the analogous 
20 matching criteria x2z in which spread is required can be obtained by executing A^xiz for each substring of 
the Doc-list and weighing the optimal match by the spread. The optimal weighted matching value would 
form the expected result. In other word. A'^az = R(A*xiz). where R denotes the reduction algorithm. An 
implementation of R is listed in Table 6. 

The different algorithms for solving similarity values under a particular set of matching criteria are now 
25 described. It is understood that although these algorithms are used in this preferred embodiment, other 
algorithms may similarly be applied. 

For similarity value Snz in which the DQ-matrix is separable, the equation can be reduced to: 

m 

(l/m) Max dq. . _ 1 _ i _ n_ 
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50 



55 



For matching criteria 111 in which the DQ-matrix is not separable, the problem is exactly a Weight 
Bipartite Matching Problem and can be solved by the Hungarian Method described in Lawler E.L., 
"Combinatorial Optimization: Networks and Matroids", Holt, Rinehart and Winston, 1976. A formulation of 
the weighted bipartite matching problem is : given a complete bipartite graph G = (S,T.SxT), _S_=n, 
_T_= m and a weight wy ^ 0 for each edge 

(s., t.) SxT, maximize (w x. .) subject to 



X 1, 1 i n; jjn; x. . _ 0 

j=l ^-^^ i»l 

The Hungarian Method can be applied by defining S and T as indices of elements in the Doc-Llst and the 
expanded query word sets, respectively, and for each edge (i,j) _ SxT. by letting dqij be its weight, i.e. wij 
= dq,i. 

For Si 12 with no separability, the Hungarian method is also used. However, it is necessary to avoid the 
case in which the maximum match is an incomplete match. In other words, it is necessary to enforce a 
situation that, if there is at least one complete match, then the maximum match is a complete match. To 
apply the Hungarian Method, complete matches must be separated from incomplete matches, such that the 
range Icompi of matching values for complete matches is disjoint from the range lincompi of matching values 
for incomplete matches. Furthermore, the maxima of interval Imcompi must be I ss than the minima of interval 
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of interval Icompt* This is expressed by the condition that Icompi must be greater than hncompi. i-o- Win 'compi > 
Max lincompf. Thus, if there exists at least one complete match, the Hungarian Method will find it If there is 
no one. an incomplete match will be found. 

To distinguish between the matching values of compi te and incomplete nnatch s, new w ights w'^ 
derived from the old weights Wy by adding the constant value m. i. .. 

w'ij : = Wij + m, 1 5 i ^ n. 1 i j i m. 

This modification preserves,the distances between any pair of weights. I.e.. 
w'ij-w'u = Wfj-Wfci. 1ii,k^n, lij.li m. 

Furthermore, this modification satisfies the previously identified condition, because 

Max I'incompi = {m-1 )Max__w'L 
MIn f compi = n Min_w'rj__w'(j> 0__ 

To evaluate S122 where the DOmatrIx is separable, two variables MATCH, WEIGHT are defined for 
storing the actual matching and the weights. i.e., if the j-th Expanded Query word is matched with the i-th 
Doc-Ust word, then MATCH[i] = i and WEIGHT_J] = dqij. 

A high level language description of the algorithm is given in Table 7. The function FIND ^assign{i), 

used in tfie algorithm of Table 7 and in algorithms described hereafter, returns the index j of the Expanded 
Query word which the i-th Doc-List word can be matched to. 

The function L(i,j) returns the similarity value dqjj of the i-th element in the Doc-List and the j-th word of 
the query. It searches in the index interval from DQ^PTRp] to DQPTR[i + 1}-1 of DQ_VECT an element 
with DQ_INDEX=j and returns the similarity value DQ_VALUE, i.e. dqij. if there is no such element, then 
the value of zero is returned. 

Algorithm A^^aa for 8122 where the DQ-matrix is separable can simply be derived from algorithm 

A*^121. 

Algorithms A^i2z can be derived from algorithm A^^^z by applying the reduction algorithm R in Table 

6. 

S*^2ii can be solved by formulating it into a Longest Common Subsequence (LCS) problem as 
followsigiven two sequences si = ai ....,a„ and and S2 =» bi .....bm where n__m. find a LCS t that satisfies the 
conditions that (1) t is a subsequence of si and (2) t is maximai.i.e. there is no other sequence t' that 
satisfies (1).. and length(t') > length(t). The LCS is well known in the literature and algorithms for solving it 
haven been known, see, for instance. A.V. Aho et ai. "Data Structures and Algorithms", Addison-Wesley, 
(1983). 

For S*^2n where the DQ-matrix is not binary, the algorithm A*^2ii can be reduced to the problem of 
finding the shortest Path in an acyclic digraph which can be obtained from the DQ-matrix using the 
algorithm shown in Table 8. 

Algorithm DAG uses a function Next(i.j) that accesses the index interval DQ_PTRlil to DQ_PTR[i + 1] - 
1 in DQ_VECT of the i-th row of tiie DQ-matrix and returns the smallest DQ _ INDEX greater than j. If no 
such Index exists, the function returns 0. This function uses, for each row of DQ, a pointer to the cunrent 
element of tiie row in DQ__VECT. This provides an efficient access in sequence to all the elements of each 
row of the DQ-matrix. 

For the evaluation of 8**^212 where the DQ-matrix is binary, an additional test is performed to check that 
all the query words have been matched, i.e. the length It of the LCS is m. 

For S^2i2, the algorithm is similar to A*^2ii for the complete case. Howetver, if a complete matching 
exists, the algorithm must return the maximum complete matching. To achieve this, the problem can be 
reduced to tfie Shortest Path Problem, the interval Imcompi of matching values for the incomplete matches 
and the interval Icompi of matching values for complete matches must be disjoint, i.e.. satisfy the previously 
identified condition. The necessary modification is analogous to that applied in algorithm A***t 12. 

For similarity values 822^, the algorithms are obtained from the algorithms A2ir through the reduction 
algoritiim R. 

For the contiguous cases Sayz. it is not necessary to distinguish between the binary and continuous 
cases. The basic idea of ail four algorithms is to scan a window over the Doc-List. In each iteration, the 
matching value V of the actual window and tiie query is computed. 

Algorithm A^^'an and algorithm A*''*3it are respectively th same as algorithm A**'3i2 and algorithm 
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10 



16 



A«*3i2. The difference between the algorithms for San and S312 is that in the complete case, only windows 
of width m are considered, whereas In the incomplete case partial windows at the beginning and the end of 
the Doc-List must be checked. Thus, the algorithms for the incomplete case are a particular extension of 
the algorithms for the complete case. 

To solve S321, a window of width m is scanned over the Doc-Ust. starting with the first position and 
stopping at the (n-m)-th position, in each of the (n-m) iterations, the matching value V is computed using 
function L(i.j). A high level language description of the algorithm is given in Table 9. 

Algorithm A^^'^2^^ consists of three phases. In the first phase, partial windows at the begin of the Doc- 
Ust are considered. In the second phase, total windows of width m in the center are considered, and in the 
third phase, partial windows at the end of the Doc-Ust are considered. The array m' is of the sanne type as 
array M, and it is used for storing the optimal matching in each phase. A high level language description of 
the algorithm is given in Table 10. 

Although an exemplary embodiment of the invention has been disclosed for purpose of illustration, it 
will be understood that various changes, modifications and substitutions may be incorporated into such 
embodiment without departing from the spirit of the invention as defined by the clainns appearing 
hereinafter. 

Tabte 1 



StTTJccure of the Vocalnilary 



20 



25 



Word 


DOC- 


List 


ID POS 


DOC- 


LIST ID 


ros 


■ ■ B • • 
• • » • • 


"search" 


doc 


672 


10»324.. 


doc 


726 


1,441., 


"text" 


doc 


411 


25,121.. 


doc 


672 


13.51.. 



record m 
record n 
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Table 2a 
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OS^ELEMENT : ncord OS.WORO : integtn 
OS.SIMILARITY : reaft 

OS.VKT : amy [i^axindex] 0S_ELEX1ENT; 
OS_PTR : array Ct«fli) 0/ Inttgwr, 
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Table 2b 
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OO.ELEMENT : rtcorrf 00 INOQC : /niegtr. 

OO.VALUE : rca/; 

•nd: 

00_VECT: array [l„/na*mrfex| 0/ OO^ELEMEHT; 
DO.PTR : array (t-nl 0/ iMtgtn 
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Table 3 



IneompUtelCompUte 


Spread hrtlerant 


Weighted ty spread 


Unordirtd 






Ordered 






Contiguous 







Table 4 

Afar (fR, Spread of /J,) 

: a Afar {ot. Spread of ^} 



Table 5 



proc9dur9 Mltetion; 

for ail words M Expanded Query do 

rmtriwm index record; 
for eecA Doc-U«t containing at least one word ol the Expanded Query do 

eraate P-veetor, OO^natrix: 
If complete fnatching is desired attd DO* has some zero row 
thmn discard Ooe*Ust; 
oiid; 
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Table 6 



Mtgorithm R(A5u {n.m.i.i.P.0O:M,V)); 
Vi-0; 

/or ic^l to n do 
btgin 

/or l:«(i4>1) fondo 
btgin 

eompuio spread D of M*; 
//V/0 > V then begin M:-M'; Vi-V/D; •ftd; 
«ftdt 
•Ad: 
•ffd. 



Table 7 



algorithm ^iai(n,m.P.DO:M,VJ; 
begin 

M:-0;V:-0; 

(• ttcracien for th« start posittans of intervaU •) 
/or START:* f ton do 
begia 

MATC» 0; W0GHT 0: 
SUM_WBGHT:- 0; 
FIRST^CAKCEL:- /«/«•; 

(• ItBfatfon fbr tho ending positions of tho interval •) 
for fc -START to n do (niin(START+ KS . n)) 
begin 

Find.as(t9n<i}; 

/r L(i.D > WEJGfrrij] 

tlien begin 

(• Exchanging of the matching •) 

(M^TCHW - STARTJ: 

WEIGHTUi m.]); 
* en^ 

(• Cempariaort of the actual matching MATCH with the 
currently optimal matching M •> 

ifv < waG>rr,suM/(P(ST>i/?7i-ff,i+ii 

and noiCFlRSTjCANCEU) 
thon btgin 

(• Better glottal matching •) 

ei^ WQCHT^SUM/(PlS7>;WT]-Ptil+t); 

•ndt 
end. 
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Table 8 



A 0; 

for «»ch non-zero aqu do 

min.r-m; 
for tadi nod* kif do 
/«rn-l+1 ton do 
b*gio 

0* <>0 tboo 

wbSo s S min^A and s <>0 do 
bogia 

«:-Nexi(r,$): 

•f»d. 



Table 9 



alfforrtAm A^t (n.m.P.0O:M.V): 
V:- 0: 

COMPIETE twm 
foe to n-m tfo 

S 0; 

for fo m rfo 
boffiin 

8 S + L(i+H,j): 

COMPLETE COMPLETE and (L(i+ j-1.i)< >0): 
•nd: 

irs > VMd COMPLETE fhfff 
6tgiEn Vt-S: 
START: »i; 

«»d: 

/or j: » 1 to m do 
Wt/l:-P[SrXflT+/]; 
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Table 10 



70 



15 



20 



25 



btgin 
V :- 0; 

{• Phase 1 •) 

for i:«t (o m*i rfo 

S :~ 0; M' 0; 

for j:«»mH+^1 to m do 

S :- S + LO-m+i.i); 
/rL(l-m+i.i) <> 0 

M't/):- P[/-/n-»-iJ; 

€11^ 

//S > V then begin V S; M :- M'; •nrf; 

(• PhasB 2 •) 

/or to rwn+1 do 

S:- 0:M':- 0; 
/or to m rfo 
begin 

S:- S + U»+H.n: 

//S > V fieg/nr V S; M :- M'; €nd; 
Olid; 



(* Phasa 3 *) 
/or h-n-m+a to n do 
begin 

S 0: 

hr\:^\ to n-i+t do 
, 6eg/A » 

30 S :- S + L(i+H.i): 



35 



M'[;l:-Pl/+i-ll: 

•nd; 

// S > V Uiofi ioffifl V S; M M*: t/tdt 
tnd; 
ofld. 



^ Claims 

1. A method implemented in a data processing apparatus for retrieving from among more tlian one 
library document those matching the content of a sequence of query words, comprising the steps of: 

(a) defining a set of equivalent words for each of the query words and assigning a word equivalence 
^ value to each of said equivalent words; and 

(b) computing a relevance factor for a library document, comprising the steps of: 

(I) locating target sequences of words in the library document that match the sequence of query words, and 
equivalence thereof, according to a set of matching criteria; 

(il) evaluating similarity values of said target sequences of words, each similarity value being evaluated as a 

function of the equivalence values of words included in the con-esponding target sequence; and 

(lit) said relevance factor being computed as a function of the similarity values of its target sequences. 

2. A method according to claim 1 wherein said function of the equivalence values in step (ii) includes a 
weighting factor based upon positions in said document of words in said target sequence. 

3. A method according to claim 1 wherein said matching criteria include the ordering of a sequ nee of 
^ words in the library document with respect to the sequenc of query words. 

4. A m thod according to claim 1 wherein said matching criteria include the completeness of a match 
between a sequence of words in the library document and the s quenc of query words. 
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5. A m thod according to claim 1 wherein said matching crit ria include the span of a sequence of 
words in th library document that match s the sequenc of query words. 

6. A method according to claim 1 comprising the additional step of preselecting a subset of library 
documents and computing the relevancy factors only for said subset of library documents, and ranking 
library documents in order of their r spective relevance factors. 

7. A method according to claim 6 wherein said relevance factor of a library document is the similarity 
value of a target sequence having the highest similarity value. 

8. A method according to claim 7. wherein a similarity value Sxyz is being evaluated according to the 
equation: 



m 

S^^(L1,L2) - Max > (l/d(f ) I 3(u.,v. ^..J 
xyz ^ xz .^^ i f^^(i) 

xz 



where LI represents the target sequence; 

L2 represents the sequence of query words: 

X represents said ordering matching criterion; 

y represents said completeness matching criterion; 

2 represents said span matching criterion; and 

d(fxz) = m if y =1. and 

= max (m, spread of fxz) if y =2. 

9. A document retrieval system storing more than one library document, an apparatus for retrieving 
library documents matching the content of a sequence of query words, comprising: 

(a) means for storing a set of equivalent words for each of the query words, each of said equivalent 
words being stored with a corresponding word equivalence vaiue; and 

(b) means coupled to said storage means for computing a relevance factor for a library document, 
comprising: 

(i) first means for receiving a set of matching criteria; 

(ii) second means coupled to said first means for locating target sequences of words in a library document 
that match the sequence of query words, and equivalence thereof, according to said matching criteria; 

(iii) third means coupled to sad second means for evaluating similarity values of said target sequences of 
words, each similarity value being evaluated as a function of the equivalence values of words included in 
the corresponding target sequence; and 

(iv) fourth means receiving said similarity values for computing said relevance factor. 

10. Apparatus as in claim 9 wherein said function is said third means includes a weighting factor based 
upon positions in said document of words in said target sequence. 
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